88 research outputs found

    Large scale biomedical texts classification: a kNN and an ESA-based approaches

    Full text link
    With the large and increasing volume of textual data, automated methods for identifying significant topics to classify textual documents have received a growing interest. While many efforts have been made in this direction, it still remains a real challenge. Moreover, the issue is even more complex as full texts are not always freely available. Then, using only partial information to annotate these documents is promising but remains a very ambitious issue. MethodsWe propose two classification methods: a k-nearest neighbours (kNN)-based approach and an explicit semantic analysis (ESA)-based approach. Although the kNN-based approach is widely used in text classification, it needs to be improved to perform well in this specific classification problem which deals with partial information. Compared to existing kNN-based methods, our method uses classical Machine Learning (ML) algorithms for ranking the labels. Additional features are also investigated in order to improve the classifiers' performance. In addition, the combination of several learning algorithms with various techniques for fixing the number of relevant topics is performed. On the other hand, ESA seems promising for this classification task as it yielded interesting results in related issues, such as semantic relatedness computation between texts and text classification. Unlike existing works, which use ESA for enriching the bag-of-words approach with additional knowledge-based features, our ESA-based method builds a standalone classifier. Furthermore, we investigate if the results of this method could be useful as a complementary feature of our kNN-based approach.ResultsExperimental evaluations performed on large standard annotated datasets, provided by the BioASQ organizers, show that the kNN-based method with the Random Forest learning algorithm achieves good performances compared with the current state-of-the-art methods, reaching a competitive f-measure of 0.55% while the ESA-based approach surprisingly yielded reserved results.ConclusionsWe have proposed simple classification methods suitable to annotate textual documents using only partial information. They are therefore adequate for large multi-label classification and particularly in the biomedical domain. Thus, our work contributes to the extraction of relevant information from unstructured documents in order to facilitate their automated processing. Consequently, it could be used for various purposes, including document indexing, information retrieval, etc.Comment: Journal of Biomedical Semantics, BioMed Central, 201

    PLoS One

    Get PDF
    MOTIVATION: The recent revolution in new sequencing technologies, as a part of the continuous process of adopting new innovative protocols has strongly impacted the interpretation of relations between phenotype and genotype. Thus, understanding the resulting gene sets has become a bottleneck that needs to be addressed. Automatic methods have been proposed to facilitate the interpretation of gene sets. While statistical functional enrichment analyses are currently well known, they tend to focus on well-known genes and to ignore new information from less-studied genes. To address such issues, applying semantic similarity measures is logical if the knowledge source used to annotate the gene sets is hierarchically structured. In this work, we propose a new method for analyzing the impact of different semantic similarity measures on gene set annotations. RESULTS: We evaluated the impact of each measure by taking into consideration the two following features that correspond to relevant criteria for a "good" synthetic gene set annotation: (i) the number of annotation terms has to be drastically reduced and the representative terms must be retained while annotating the gene set, and (ii) the number of genes described by the selected terms should be as large as possible. Thus, we analyzed nine semantic similarity measures to identify the best possible compromise between both features while maintaining a sufficient level of details. Using Gene Ontology to annotate the gene sets, we obtained better results with node-based measures that use the terms' characteristics than with measures based on edges that link the terms. The annotation of the gene sets achieved with the node-based measures did not exhibit major differences regardless of the characteristics of terms used

    Yearb Med Inform

    Get PDF
    OBJECTIVES: To introduce the 2022 International Medical Informatics Association (IMIA) Yearbook by the editors. METHODS: The editorial provides an introduction and overview to the 2022 IMIA Yearbook whose special topic is "Inclusive Digital Health: Addressing Equity, Literacy, and Bias for Resilient Health Systems". The special topic, survey papers, section editor synopses and some best papers are discussed. The sections' changes in the Yearbook Editorial Committee are also described. RESULTS: As shown in the previous edition, health informatics in the context of a global pandemic has led to the development of ways to collect, standardize, disseminate and reuse data worldwide. The Corona Virus Disease 2019 (COVID-19) pandemic has demonstrated the need for timely, reliable, open, and globally available information to support decision making. It has also highlighted the need to address social inequities and disparities in access to care across communities. This edition of the Yearbook acknowledges the fact that much work has been done to study health equity in recent years in the various fields of health informatics research. CONCLUSION: There is a strong desire to better consider disparities between populations to avoid biases being induced in Artificial Intelligence algorithms in particular. Telemedicine and m-health must be more inclusive for people with disabilities or living in isolated geographical areas

    Mapping data elements to terminological resources for integrating biomedical data sources

    Get PDF
    BACKGROUND: Data integration is a crucial task in the biomedical domain and integrating data sources is one approach to integrating data. Data elements (DEs) in particular play an important role in data integration. We combine schema- and instance-based approaches to mapping DEs to terminological resources in order to facilitate data sources integration. METHODS: We extracted DEs from eleven disparate biomedical sources. We compared these DEs to concepts and/or terms in biomedical controlled vocabularies and to reference DEs. We also exploited DE values to disambiguate underspecified DEs and to identify additional mappings. RESULTS: 82.5% of the 474 DEs studied are mapped to entries of a terminological resource and 74.7% of the whole set can be associated with reference DEs. Only 6.6% of the DEs had values that could be semantically typed. CONCLUSION: Our study suggests that the integration of biomedical sources can be achieved automatically with limited precision and largely facilitated by mapping DEs to terminological resources

    Visualizing Food-Drug Interactions in the Theriaque Database

    Get PDF
    This paper presents a prototype for the visualization of food-drug interactions implemented in the MIAM project, whose objective is to develop methods for the extraction and representation of these interactions and to make them available in the Thériaque database. The prototype provides users with a graphical visualization showing the hierarchies of drugs and foods in front of each other and the links between them representing the existing interactions as well as additional details about them, including the number of articles reporting the interaction. The prototype is interactive in the following ways: hierarchies can be easily folded and unfolded, a filter can be applied to view only certain types of interactions, and details about a given interaction are displayed when the mouse is moved over the corresponding link. Future work includes proposing a version more suitable for non-health professional users and the representation of the food hierarchy based on a reference classification

    Stud Health Technol Inform

    Get PDF
    Clinical information in electronic health records (EHRs) is mostly unstructured. With the ever-increasing amount of information in patients' EHRs, manual extraction of clinical information for data reuse can be tedious and time-consuming without dedicated tools. In this paper, we present SmartCRF, a prototype to visualize, search and ease the extraction and structuration of information from EHRs stored in an i2b2 data warehouse

    Obo foundry food ontology interconnectivity

    Get PDF
    Since its creation in 2016, the FoodOn ontology has become an interconnected partner in various academic and government inter-agency ontology work spanning agricultural and public health domains. This paper examines existing and potential data interoperability capabilities arising from FoodOn and partner food-related ontologies belonging to the encyclopedic Open Biological and Biomedical Ontology Foundry (OBO) vocabulary platform, and how research organizations and industry might utilize them for their own operations or for data exchange. Projects are seeking standardized vocabulary across all direct food supply activities ranging from agricultural production, harvesting, preparation, food processing, marketing, distribution and consumption, as well as indirectly, within health, economic, food security and sustainability analysis and reporting tools. To satisfy this demand and provide data requires establishing domain specific ontologies whose curators coordinate closely to produce recommended patterns for food system vocabulary

    Development of a fixed module repertoire for the analysis and interpretation of blood transcriptome data.

    Get PDF
    As the capacity for generating large-scale molecular profiling data continues to grow, the ability to extract meaningful biological knowledge from it remains a limitation. Here, we describe the development of a new fixed repertoire of transcriptional modules, BloodGen3, that is designed to serve as a stable reusable framework for the analysis and interpretation of blood transcriptome data. The construction of this repertoire is based on co-clustering patterns observed across sixteen immunological and physiological states encompassing 985 blood transcriptome profiles. Interpretation is supported by customized resources, including module-level analysis workflows, fingerprint grid plot visualizations, interactive web applications and an extensive annotation framework comprising functional profiling reports and reference transcriptional profiles. Taken together, this well-characterized and well-supported transcriptional module repertoire can be employed for the interpretation and benchmarking of blood transcriptome profiles within and across patient cohorts. Blood transcriptome fingerprints for the 16 reference cohorts can be accessed interactively via: https://drinchai.shinyapps.io/BloodGen3Module/
    • 

    corecore